Producing an Encyclopedic Dictionary using Patent Documents

نویسنده

  • Atsushi Fujii
چکیده

Although the World Wide Web has of late become an important source to consult for the meaning of words, a number of technical terms related to high technology are not found on the Web. This paper describes a method to produce an encyclopedic dictionary for high-tech terms from patent information. We used a collection of unexamined patent applications published by the Japanese Patent Office as a source corpus. Given this collection, we extracted terms as headword candidates and retrieved applications including those headwords. Then, we extracted paragraph-style descriptions and categorized them into technical domains. We also extracted related terms for each headword. We have produced a dictionary including approximately 400 000 Japanese terms as headwords. We have also implemented an interface with which users can explore our dictionary by reading text descriptions and viewing a related-term graph.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic-Driven Multi-Document Summarization with Encyclopedic Knowledge and Spreading Activation

Information of interest to users is often distributed over a set of documents. Users can specify their request for information as a query/topic – a set of one or more sentences or questions. Producing a good summary of the relevant information relies on understanding the query and linking it with the associated set of documents. To “understand” the query we expand it using encyclopedic knowledg...

متن کامل

Thesaurus Expansion using Similar Wor

In both written and spoken languages, we sometimes use different words in order to describe the same meaning. For instance, we use “constraint” (seigen) and “restriction” (seiyaku) as the same meaning. This makes text classification and text summarization difficult. In order to deal with this problem, dictionaries especially thesauri are used. However, in technical paper and patent documents, a...

متن کامل

SemEval-2013 Task 12: Multilingual Word Sense Disambiguation

This paper presents the SemEval-2013 task on multilingual Word Sense Disambiguation. We describe our experience in producing a multilingual sense-annotated corpus for the task. The corpus is tagged with BabelNet 1.1.1, a freely-available multilingual encyclopedic dictionary and, as a byproduct, WordNet 3.0 and the Wikipedia sense inventory. We present and analyze the results of participating sy...

متن کامل

Customizing an English-Korean Machine Translation System for Patent/Technical Documents Translation

This paper addresses a method for customizing an English-Korean machine translation system from general domain to patent or technical document domain. The customizing method includes the followings: (1) adapting the probabilities of POS tagger trained from general domain to the specific domain, (2) syntactically analyzing long and complex sentences by recognizing coordinate structures, and (3) ...

متن کامل

Identification of Chemical Entities in Patent Documents

Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train nam...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008